Stereo Vision Lip-tracking for Audio-video Speech Processing
نویسندگان
چکیده
We present the first results from applying a recently proposed novel algorithm for the robust and reliable automatic extraction of lip feature points to an audio-video speech data corpus. This corpus comprises 10 native speakers uttering sequences that cover the range of phonemes and visemes in Australian English. The lip-tracking algorithm is based on stereo vision which has the advantage of measurements being in real-world (3D) coordinates, instead of image (2D) coordinates. Certain lip feature points on the inner lip contour such as the lip corners and the mid-points of upper and lower lip are automatically tracked. Parameters describing the shape of the mouth are derived from these points. The results obtained so far show that there is a correlation between width and height of the mouth opening as well as between the protrusion parameters of upper and lower lips.
منابع مشابه
3d Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-video Automatic Speech Recognition
Multimodality is a key issue in robust humancomputer interaction. The joint use of audio and video speech variables has been shown to improve the performance of automatic speech recognition (ASR) systems. However, robust methods in particular for the real-time extraction of video speech features are still an open research area. This paper addresses the robustness issue of audio-video (AV) ASR s...
متن کامل3d Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-video Automatic Speech Recognition
Multimodality is a key issue in robust humancomputer interaction. The joint use of audio and video speech variables has been shown to improve the performance of automatic speech recognition (ASR) systems. However, robust methods in particular for the real-time extraction of video speech features are still an open research area. This paper addresses the robustness issue of audio-video (AV) ASR s...
متن کاملA Stereo Vision Lip Tracking Algorithm and Subsequent Statistical Analyses of the Audio-Video Correlation in Australian English
Human perception of the world is inherently multi-sensory because the information provided is multimodal. The perception of spoken language is no exception. Beside the auditory information, there is visual speech information as well, provided by the facial movements as a result of moving the articulators during speech production. Visual speech information contributes to speech perception in all...
متن کاملWAPUSK20 - A Database for Robust Audiovisual Speech Recognition
Audiovisual speech recognition (AVSR) systems have been proven superior over audio-only speech recognizers in noisy environments by incorporating features of the visual modality. In order to develop reliable AVSR systems, appropriate simultaneously recorded speech and video data is needed. In this paper, we will introduce a corpus (WAPUSK20) that consists of audiovisual data of 20 speakers utte...
متن کاملStereo 3d Lip Tracking
Gareth Loy, Roland Goecke, Sebastien Rougeaux and Alexander Zelinsky Research School of Information Sciences and Engineering Australian National University, Canberra 0200, Australia fgareth, roland, rougeaux, [email protected] Abstract A system is presented that tracks in 3D a person's unadorned lips, and outputs the 3D locations of the mouth corners and ten points describing the outer li...
متن کامل